---
title: Evaluation Concepts and Overview
---

<div style={{
    position: 'relative',
    paddingBottom: '56.25%', // 16:9 aspect ratio
    height: 0,
    overflow: 'hidden',
    maxWidth: '100%',
    marginBottom: '20px'
}}>
    <iframe
        src="https://www.loom.com/embed/a0fadd90c46d41a6b0d8816e1b924b95?sid=17e17c65-f335-486c-9f59-b8ac33eced3f"
        frameborder="0"
        webkitallowfullscreen
        mozallowfullscreen
        allowfullscreen
        style={{
            position: 'absolute',
            top: 0,
            left: 0,
            width: '100%',
            height: '100%',
        }}
    />
</div>

## Understanding LLM Evaluation with Opik

This video introduces the fundamentals of LLM [evaluation](https://www.comet.com/docs/opik/evaluation/overview) and why it differs from traditional machine learning [metrics](https://www.comet.com/docs/opik/evaluation/metrics/overview). Unlike conventional ML evaluation that relies on accuracy and F1 scores, LLM [evaluation](https://www.comet.com/docs/opik/evaluation/overview) requires assessing text qualities like relevance, accuracy, and helpfulness. You'll learn about Opik's systematic three-component evaluation framework and see how it enables quantitative performance measurement across hundreds of test cases.

## Key Highlights

- **Beyond Traditional [Metrics](https://www.comet.com/docs/opik/evaluation/metrics/overview)**: LLM [evaluation](https://www.comet.com/docs/opik/evaluation/overview) requires new approaches since outputs are text that must be assessed for qualities like relevance, accuracy, and helpfulness
- **Three-Component Framework**: Opik's [evaluation](https://www.comet.com/docs/opik/evaluation/overview) system consists of [datasets](https://www.comet.com/docs/opik/evaluation/manage_datasets) (example inputs/outputs), [metrics](https://www.comet.com/docs/opik/evaluation/metrics/overview) (automated scoring methods), and [experiments](https://www.comet.com/docs/opik/evaluation/evaluate_your_llm) (evaluation runs)
- **Comprehensive [Dataset](https://www.comet.com/docs/opik/evaluation/manage_datasets) Management**: Collections of example inputs and expected outputs that represent your specific use cases and requirements
- **Flexible [Metrics](https://www.comet.com/docs/opik/evaluation/metrics/overview) System**: From simple [heuristics](https://www.comet.com/docs/opik/evaluation/metrics/heuristic_metrics) to sophisticated [LLM-as-a-judge](https://www.comet.com/docs/opik/evaluation/metrics/g_eval) approaches for automated output scoring
- **Systematic Experimentation**: Each [experiment](https://www.comet.com/docs/opik/evaluation/evaluate_your_llm) represents a specific LLM application configuration tested against [datasets](https://www.comet.com/docs/opik/evaluation/manage_datasets) with defined [metrics](https://www.comet.com/docs/opik/evaluation/metrics/overview)
- **Model Comparison Power**: Compare different models (GPT-3.5 vs Claude 3.5 vs GPT-4 vs Gemini) systematically on the same [datasets](https://www.comet.com/docs/opik/evaluation/manage_datasets)
- **[Prompt](https://www.comet.com/docs/opik/prompt_engineering/prompt_management) Template Testing**: [Evaluate](https://www.comet.com/docs/opik/evaluation/evaluate_prompt) various prompt templates against [datasets](https://www.comet.com/docs/opik/evaluation/manage_datasets) with specific models for optimization
- **Quantitative Decision Making**: Replace subjective judgments based on few examples with quantitative measurement across hundreds of test cases
- **Production Confidence**: Structured [evaluation](https://www.comet.com/docs/opik/evaluation/overview) approach provides confidence before deploying LLM applications to production
